library(atus)
library(dplyr)
library(rpart)
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.1.2
library(ggplot2)
library(RColorBrewer)
library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.1.3
library(splines)

Refined Goals of Analysis

Our group is using the atus database to investigate whether various demographic variables correlate to family income. Specifically, we will be looking at region, state, educational attainment, home ownership (rent vs. own), household size, race, and time usage. After visualizing the data, we will test the correlations between each of these demographic variables and income using linear regression or other related statistical analyses. Using the results from these statistical analyses, we will determine relationships between the demographic variables and how they collectively impact family income and how income potentially affects time usage. This information could be useful to better inform policy regarding issues relevant to specific demographics.

Exploring the raltionship between home type and family income

library(tidyr)
library(dplyr)
library(atus)
data("atuscps")
tibble1 <- as_tibble(atuscps) %>%
  na.omit()
library(ggplot2)
library(RColorBrewer)
tibble1 <- as_tibble(atuscps) %>%
  na.omit()
plotcps <- ggplot(data=tibble1) +
  geom_bar(mapping=aes(x=famincome, fill=home_type)) +
  xlab("Family Income") +
  ylab("Count") +
  scale_color_brewer("Set1")
plotcps + theme(axis.text.x = element_text(angle = 27, size = 6)) 

library(ggplot2)
library(RColorBrewer)
plotdps <- ggplot(data=tibble1) +
  geom_bar(mapping=aes(x=home_type, fill=famincome)) + 
  xlab("Home Type") +
  ylab("Count") +
  scale_color_brewer("Set1")
plotdps + theme(axis.text.x = element_text(angle = 27, size = 6))

VISUALIZATION ANALYSIS:

Color-coded simple stacked bar charts, using ggplot2, were chosen to visualize the data because this method is useful in comparing the total amounts across each segmented group. Additionally, stacked bar charts have the advantages of being able to easily view the data and identify potential outliers.

When considering the relationship between the type of home (own, rent, occupy without rent) and family income, the two stacked bar charts visualize the following. No matter what family income range one falls under, there seems to be worth and demand in owning a house. This means even within a humble, lower family income range, there are still home-owners present. This offers the potentially interesting finding that people of lower socioeconomic classes are still trying to chase “the American Dream” or stability, despite renting being a more feasible option for their financial situation. It must be said of course, that this idea is not generalizable to that socioeconomic class as a whole because as both stacked bar charts suggest visually, for lower socioeconomic classes, the renter population still remains more popular than the homeowner population. On the contrary, we see that in the middle to higher income brackets, there are certainly a much larger count of homeowners compared to its opposing category, renters.

library(atus)
chart = table(atuscps$famincome, atuscps$home_type) 
chart
##                
##                   own  rent occupy without rent
##   < 5000         1291  2718                 249
##   5000-7499      1171  2196                 102
##   7500-9999      1614  2666                 132
##   10000-12499    2588  3097                 156
##   12500-14999    2614  2492                 108
##   15000-19999    4451  3766                 154
##   20000-24999    5771  4086                 157
##   25000-29999    6269  3783                 170
##   30000-34999    6850  3484                 117
##   35000-39999    6547  2747                 134
##   40000-49999   11020  3734                 156
##   50000-59999   11462  2755                 106
##   60000-74999   14442  2621                  87
##   75000-99999   20041  2360                  86
##   100000-149999 14486  1212                  52
##   150000+       10483   698                  20
chisq.test(chart)
## 
##  Pearson's Chi-squared test
## 
## data:  chart
## X-squared = 26953, df = 30, p-value < 2.2e-16

RELATIONSHIP SIGNIFICANCE ANALYSIS (Pearson’s Chi-Squared Test):

As the p-value (<2.2e-16) is less than the .05 significance level, and the chi-squared value is notably large (26953), we reject the null hypothesis and conclude that the two variables, family income and home type (own, rent, occupy without rent), are in fact dependent and have a significant relationship.

Exploring the relationship between household size and Income

library(tidyr)
library(dplyr)
library(ggplot2)
library(RColorBrewer)
library(atus)
atus2 <- cbind(atuscps, atusresp) 
atusresp$hh_size <- as.character(as.numeric(atusresp$hh_size))
atus3 <- atus2 %>%
  select(famincome, hh_size)
atus4 <- as_tibble(atus3)
chart1 <- ggplot(data=atus4) +
  geom_bar(mapping=aes(x=famincome, fill=hh_size)) + 
  xlab("Family Income") +
  ylab("Count") +
  scale_color_brewer("Set1") 
chart1 + theme(axis.text.x = element_text(angle = 27, size = 6))

library(tidyr)
library(dplyr)
library(ggplot2)
library(RColorBrewer)
library(atus)
atus2 <- cbind(atuscps, atusresp) 
atusresp$hh_size <- as.character(as.numeric(atusresp$hh_size))
atus3 <- atus2 %>%
  select(famincome, hh_size)
atus4 <- as_tibble(atus3)
chart14 <- ggplot(data=atus4) +
  geom_bar(mapping=aes(x=hh_size, fill=famincome)) + 
  xlab("Household Size") +
  ylab("Count") +
  scale_color_brewer("Set1") 
chart14 + theme(axis.text.x = element_text(angle = 27, size = 6))

VISUALIZATION ANALYSIS:

The stacked bar charts depicting the relationship between household size (1-16) and family income suggest that there is a varied and potentially similar distribution of household size across each family income range. Further, in taking a closer look at the distribution of family income across household sizes, it is clear that there is a varied distribution, with no income group particularly prominent in each household size. Hence, we must use statistical analyses to see if household size and family income have a significant relationship.

library(atus)
chart2 = table(atuscps$famincome, atusresp$hh_size) 
chart2
##                
##                    1   10   11   12   13   14   15   16    2    3    4    5
##   < 5000        1763    2    0    0    0    0    0    0 1010  654  478  217
##   5000-7499     1684    3    0    1    0    0    0    0  736  446  336  169
##   7500-9999     2449    2    0    1    1    0    0    0  858  528  311  169
##   10000-12499   2879    4    3    1    0    0    0    0 1302  707  516  234
##   12500-14999   2376    7    2    0    2    0    0    0 1210  653  509  284
##   15000-19999   3317    2    2    1    0    0    0    0 2181 1179  886  506
##   20000-24999   3586    5    3    1    0    0    0    0 2814 1442 1112  634
##   25000-29999   3257    6    3    2    0    0    0    0 2948 1595 1295  671
##   30000-34999   3115    9    1    2    0    0    0    1 3059 1611 1463  730
##   35000-39999   2575   15    3    1    1    0    0    0 2689 1572 1401  715
##   40000-49999   3656   18    2    2    2    1    0    0 4333 2622 2364 1207
##   50000-59999   2804   13    2    0    1    0    0    0 4155 2641 2705 1262
##   60000-74999   2623    9    4    3    0    0    0    0 4719 3448 3869 1654
##   75000-99999   2396    6    4    3    0    1    1    0 5799 4837 5890 2442
##   100000-149999 1309    7    3    2    0    0    0    0 3754 3379 4685 1831
##   150000+        723    8    1    0    1    0    0    0 2519 2299 3577 1492
##                
##                    6    7    8    9
##   < 5000          71   49   14    5
##   5000-7499       56   30    7    5
##   7500-9999       75   15    8    2
##   10000-12499    131   42   18    6
##   12500-14999    110   38   17   12
##   15000-19999    185   77   28   12
##   20000-24999    274   88   42   16
##   25000-29999    296   92   41   23
##   30000-34999    296  118   40   18
##   35000-39999    300  106   41   14
##   40000-49999    450  178   61   23
##   50000-59999    509  153   59   27
##   60000-74999    550  186   69   31
##   75000-99999    778  233   85   32
##   100000-149999  578  151   43   26
##   150000+        441  101   36   13
chisq.test(chart2)
## Warning in chisq.test(chart2): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  chart2
## X-squared = 21089, df = 225, p-value < 2.2e-16

RELATIONSHIP SIGNFICANCE ANALYSIS (Pearson’s Chi-Squared Test):

As the p-value (<2.2e-16) is less than the .05 significance level, and the chi-squared value is notably large (21089), we reject the null hypothesis and conclude that the two variables, family income and household size (1-16), are in fact dependent and have a significant relationship.

Impact of demographic variables on time usage

Code Setup

knitr::opts_chunk$set(echo = TRUE, eval = TRUE, fig.width = 9, fig.height = 5.5, 
                      out.width = "90%", fig.align = "center", cache = T) 

Data Manipulation

The ATUS data is grouped into three separate datasets–atusact, atuscps, and atusresp–that contain information on peoples’ time usage, demographics, and personal information respectively. We will begin by bringing in the atusact dataset and ungrouping it so that we can start manipulating it as we wish.

time_data = ungroup(atusact)
glimpse(time_data)
## Rows: 2,094,140
## Columns: 3
## $ tucaseid <dbl> 2.00301e+13, 2.00301e+13, 2.00301e+13, 2.00301e+13, 2.00301e+~
## $ tiercode <int> 10101, 10201, 110101, 120303, 130124, 10101, 10201, 30101, 11~
## $ dur      <int> 870, 40, 5, 325, 200, 620, 60, 60, 90, 530, 60, 20, 560, 80, ~

In its original state, atusact contains tiercodes with three, two digit pairs. In this combined format the tiercodes provide very detailed information into the exact activities the individual is performing for whatever length of time; however, what we really want to examine is the overall time various people spend on certain categories of tasks that encompass a broad range of related activities. Therefore, what we need to do is isolate the two digit pairings in the tiercodes to gain a better view of this. As you can see, this simply requires some arithmetic to separate out the various tiers.

time_data2 = time_data %>% 
  group_by(tiercode) %>%
  summarise(tucaseid = tucaseid,
            tier1_2Code = tiercode %/% 100,
            tier1 = tiercode %/% 10000,
            tier2 = (tiercode %% 10000) %/% 100, 
            tier3 = (tiercode %% 10000) %% 100,
            dur = dur) %>%
  inner_join(atuscps) %>% 
  select(tiercode, tier1, dur, region, state, famincome, edu) %>%
  drop_na()
## `summarise()` has grouped output by 'tiercode'. You can override using the `.groups` argument.
## Joining, by = "tucaseid"
time_data2

Additionally, as shown above, we want to join the atusact dataset with atuscps so that we can begin analyzing the relationships between peoples’ time usage and various demographics.

Family Income

 

First, we will look into the impact of family income on time usage.

time_data3 = time_data2 %>%
  group_by(tier1, famincome) %>%
  mutate(TDur = sum(dur)) %>% 
  select(famincome, tier1, TDur) %>% 
  distinct() %>% 
  arrange(tier1, TDur)
time_data3

Geographic Region

 

Next, we will examine the impact of region on time usage

time_data4 = time_data2 %>%
  group_by(tier1, region) %>%
  mutate(RDur = sum(dur)) %>% 
  select(region, tier1, RDur) %>% 
  distinct() %>% 
  arrange(tier1, RDur)
time_data4

Educational Attainment

 

And finally, we will look into the impact of educational attainment on time usage

time_data5 = time_data2 %>%
  group_by(tier1, edu) %>%
  mutate(EDur = sum(dur)) %>% 
  select(edu, tier1, EDur) %>% 
  distinct() %>% 
  arrange(tier1, EDur)
time_data5

Generating Graphics

For each of our demographic variables (famincome, region, and edu) we will create a stacked bar graph that displays the percentage of the time that each subgroup of the demographic spends on certain types of activities (i.e. Personal Care)

time_data3$tier1 = as.factor(time_data3$tier1)
colourCount = length(unique(time_data3$tier1))
getPalette = colorRampPalette(brewer.pal(12, "Paired"))
ggplot(data = time_data3, mapping = aes(x = famincome, y = TDur, 
                                                      fill = tier1)) +
  labs(title = "The Impact of Income on Time Usage Distribution", 
       subtitle = "Data from the 2003-2016 American Time Usage Survey", 
       y = "Time Usage (% of Total)", x = "Family Income") +
  scale_fill_brewer(palette = "Set1") +
  geom_col(position = "fill") + 
  scale_y_continuous(breaks = seq(0.0, 1.0, 0.1), labels = function(x){
    paste0(x * 100, '%')}) +
  theme_bw()  +
  scale_fill_manual(name = "Category", labels = c("Personal Care", "Household",
                    "Caring for HH Members", "Caring for NonHH Members", "Work",
                    "Education", "Shopping", "Professional Serv", 
                    "Household Serv", "Gov Serv & Civic Obligations", 
                    "Eating & Drinking", "Socializing & Leisure", 
                    "Exercise & Recreation", "Religious & Spiritual",
                    "Volunteering", "Telephone Calls", "Traveling"), 
                    values = getPalette(colourCount)) + 
  coord_flip() 
## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.

Upon cursory inspection, it appears that as family income increases, people tend to spend less time on Personal Care (includes sleeping) and Socializing & Leisure related activities, while spending more time Traveling, on Exercise & Recreation, and on Work related activities.

time_data4$tier1 = as.factor(time_data4$tier1)
ggplot(data = time_data4, mapping = aes(x = region, y = RDur, fill = tier1)) +
  labs(title = "The Impact of Region on Time Usage Distribution", 
       subtitle = "Data from the 2003-2016 American Time Usage Survey", 
       y = "Time Usage (% of Total)", x = "Region") +
  geom_col(position = "fill") +
  scale_y_continuous(breaks = seq(0.0, 1.0, 0.1), labels = function(x){
    paste0(x * 100, '%')}) +
  theme_bw()  +
  scale_fill_manual(name = "Category", labels = c("Personal Care", "Household",
                    "Caring for HH Members", "Caring for NonHH Members", "Work",
                    "Education", "Shopping", "Professional Serv",
                    "Household Serv", "Gov Serv & Civic Obligations",
                    "Eating & Drinking", "Socializing & Leisure",
                    "Exercise & Recreation", "Religious & Spiritual",
                    "Volunteering", "Telephone Calls", "Traveling"),
                    values = getPalette(colourCount)) +
  coord_flip()

Looking at the resulting bar graph, it appears that region has little impact on time usage as the percentages of time spend on different types of activities across various regions of the United States line up almost perfectly. Although there are some slight deviations, for the most part it seems that region does not influence time usage significantly.

time_data5$tier1 = as.factor(time_data5$tier1)
ggplot(data = time_data5, mapping = aes(x = edu, y = EDur, fill = tier1)) +
  labs(title = "The Impact of Educational Attainment on Time Usage Distribution",
       subtitle = "Data from the 2003-2016 American Time Usage Survey", 
       y = "Time Usage (% of Total)", x = "Educational Attainment") +
  geom_col(position = "fill") + 
  scale_y_continuous(breaks = seq(0.0, 1.0, 0.1), labels = function(x){
    paste0(x * 100, '%')}) +
  theme_bw()  +
  scale_fill_manual(name = "Category", labels = c("Personal Care", "Household",
                    "Caring for HH Members", "Caring for NonHH Members", "Work",
                    "Education", "Shopping", "Professional Serv",
                    "Household Serv", "Gov Serv & Civic Obligations",
                    "Eating & Drinking", "Socializing & Leisure",
                    "Exercise & Recreation", "Religious & Spiritual", 
                    "Volunteering", "Telephone Calls", "Traveling"), 
                    values = getPalette(colourCount)) +
  coord_flip()

As one might expect, the trends in the resulting bar graph seem to reflect those that appear in the graph for family income. It is well known that there is a strong correlation between these two variables so this is not that surprising. Most notable among the differences in the visual is that it appears that as educational attainment increases, people spend more time on Work and less time on Socializing & Leisure as well as on Personal Care. We saw these same exact trends in the graph for family income if you recall the result of the earlier code chunk.

Analyzing Relationship Strength/Correlation

Now, we will fit and test various models on the datasets we have created to determine whether these demographics truly have a significant impact on the time usage of people in the ATUS. We will try a number of models and examine the results.

Linear Regression

 

We will start out by conducting some basic tests for linear regression.

mod1 = lm(TDur ~ famincome, data = time_data3) 
summary(mod1)
## 
## Call:
## lm(formula = TDur ~ famincome, data = time_data3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1885303  -797445  -378265  -101556 10611074 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)   
## (Intercept)              357747     413991   0.864  0.38832   
## famincome5000-7499       -66123     585472  -0.113  0.91017   
## famincome7500-9999        13159     585472   0.022  0.98209   
## famincome10000-12499     132774     585472   0.227  0.82077   
## famincome12500-14999      80370     585472   0.137  0.89092   
## famincome15000-19999     345194     585472   0.590  0.55598   
## famincome20000-24999     483598     585472   0.826  0.40957   
## famincome25000-29999     501555     585472   0.857  0.39243   
## famincome30000-34999     521621     585472   0.891  0.37380   
## famincome35000-39999     434728     585472   0.743  0.45845   
## famincome40000-49999     895620     585472   1.530  0.12732   
## famincome50000-59999     846636     585472   1.446  0.14938   
## famincome60000-74999    1085063     585472   1.853  0.06499 . 
## famincome75000-99999    1533661     585472   2.620  0.00933 **
## famincome100000-149999   966683     585472   1.651  0.09994 . 
## famincome150000+         583459     585472   0.997  0.31992   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1707000 on 256 degrees of freedom
## Multiple R-squared:  0.06503,    Adjusted R-squared:  0.01024 
## F-statistic: 1.187 on 15 and 256 DF,  p-value: 0.2818

Somewhat unexpectedly based on our observations of the bar graph, it appears that family income is not as influential as we thought on peoples’ time usage. While the higher income brackets do show relatively low p-values, the only factor level that produces a p-value less than .05 is ‘famincome75000-99999’.

mod2 = lm(RDur ~ region, data = time_data4)
summary(mod2)
## 
## Call:
## lm(formula = RDur ~ region, data = time_data4)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -5086741 -3044299 -2190062  -355643 30292027 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)    2423961    1563676   1.550    0.126
## regionmidwest  1011083    2211372   0.457    0.649
## regionsouth    2683688    2211372   1.214    0.229
## regionwest      691334    2211372   0.313    0.756
## 
## Residual standard error: 6447000 on 64 degrees of freedom
## Multiple R-squared:  0.02427,    Adjusted R-squared:  -0.02146 
## F-statistic: 0.5307 on 3 and 64 DF,  p-value: 0.6628

As we expected from the graph we generated, region has little impact on time usage. None of the p-values for the regions are below .05.

mod3 = lm(EDur ~ edu, data = time_data5)
summary(mod3)
## 
## Call:
## lm(formula = EDur ~ edu, data = time_data5)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3625966 -1980998  -640999     5120 21502217 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)  
## (Intercept)           2195607     906423   2.422   0.0168 *
## eduhs diploma         1443162    1281876   1.126   0.2623  
## edusome college        306506    1281876   0.239   0.8114  
## eduassociate degree   -894382    1281876  -0.698   0.4866  
## edubachelor's degree   588046    1281876   0.459   0.6472  
## edumaster's degree    -989201    1281876  -0.772   0.4417  
## eduprof degree       -1974559    1281876  -1.540   0.1259  
## edudoctoral degree   -1962477    1281876  -1.531   0.1283  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3737000 on 128 degrees of freedom
## Multiple R-squared:  0.09105,    Adjusted R-squared:  0.04134 
## F-statistic: 1.832 on 7 and 128 DF,  p-value: 0.08655

Again, somewhat surprising results based on the appearance of our graph. The p-values for the various educational attainments are even worse than those for family income and none possess a value below .05.

Multi-variable Linear Regression

 

Lets try grouping the dataset observations by tier, famincome, region, and edu to see if we find anything different.

time_data6 = time_data2 %>% 
  group_by(tier1, famincome, region, edu) %>% 
  mutate(Dur = sum(dur)) %>% 
  select(tier1, famincome, region, edu, Dur) %>%
  distinct()
time_data6
mod4 = lm(Dur ~ famincome + region + edu, data = time_data6)
summary(mod4)
## 
## Call:
## lm(formula = Dur ~ famincome + region + edu, data = time_data6)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -102071  -31548  -13718    4322 1128166 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              8427.54    4231.86   1.991 0.046465 *  
## famincome5000-7499      -3816.35    4879.66  -0.782 0.434183    
## famincome7500-9999         55.08    4818.25   0.011 0.990879    
## famincome10000-12499     4193.11    4808.21   0.872 0.383194    
## famincome12500-14999     2620.55    4779.92   0.548 0.583541    
## famincome15000-19999    11843.57    4760.34   2.488 0.012868 *  
## famincome20000-24999    16398.60    4736.64   3.462 0.000539 ***
## famincome25000-29999    16975.94    4734.40   3.586 0.000338 ***
## famincome30000-34999    17600.74    4714.12   3.734 0.000190 ***
## famincome35000-39999    14753.03    4727.50   3.121 0.001811 ** 
## famincome40000-49999    29370.38    4671.28   6.287 3.40e-10 ***
## famincome50000-59999    27975.38    4675.53   5.983 2.28e-09 ***
## famincome60000-74999    35430.09    4673.16   7.582 3.79e-14 ***
## famincome75000-99999    49180.31    4651.05  10.574  < 2e-16 ***
## famincome100000-149999  31659.31    4659.33   6.795 1.16e-11 ***
## famincome150000+        19701.79    4659.55   4.228 2.38e-05 ***
## regionmidwest            8514.51    2351.50   3.621 0.000295 ***
## regionsouth             22214.56    2337.50   9.504  < 2e-16 ***
## regionwest               5875.65    2346.52   2.504 0.012300 *  
## eduhs diploma           22456.89    3200.04   7.018 2.44e-12 ***
## edusome college          4728.70    3204.49   1.476 0.140078    
## eduassociate degree    -14254.41    3212.78  -4.437 9.25e-06 ***
## edubachelor's degree     9207.77    3207.50   2.871 0.004106 ** 
## edumaster's degree     -15479.89    3244.43  -4.771 1.86e-06 ***
## eduprof degree         -32806.64    3439.43  -9.538  < 2e-16 ***
## edudoctoral degree     -32875.96    3429.05  -9.587  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 74240 on 8026 degrees of freedom
## Multiple R-squared:  0.09374,    Adjusted R-squared:  0.09092 
## F-statistic: 33.21 on 25 and 8026 DF,  p-value: < 2.2e-16

The results of this tests are still rather confusing as the p-values don’t line up with the results from earlier tests. The results show both educational attainment and region as having significant impact on time usage despite the results of the earlier tests. Based on this test the most significant predictors are certain levels of each factor. These levels include edudoctoral degree, eduprof degree, regionsouth, and famincome 75000-99999.

Conclusion

Overall, based on the results of the tests and the graphics generated, I think that it is difficult to reach a decisive decision on the impact of the demographic variables Family Income, Region, and Educational Attainment on Time Usage. However, I would say that it seems these three variables do not have that significant of an impact on Time Usage. If I were to rank the impact of the variables it would likely go Family Income, Educational Attainment, and then finally Region.

Investigating factors which influence family income

Here, we investigated the relationship between various demographic variables and income.

# Find average family income by educational attainment
avg_income_by_edu <- atuscps 
avg_income_by_edu <- avg_income_by_edu %>%
  separate(famincome, into=c('income_low','income_high'), sep='-',convert=TRUE)%>%
  mutate(income_low = as.integer(income_low)) %>%
  drop_na() %>%
  mutate(fam_income_mid = (income_high+income_low)/2) %>%
  group_by(education=edu) %>%
  summarise(N = n(),
            avg_income = mean(fam_income_mid)) %>%
  arrange(avg_income)
avg_income_by_edu
#Visualize results
ggplot(data=avg_income_by_edu, aes(x=education, y=avg_income)) +
  geom_bar(stat='identity',  fill = brewer.pal(n=8,name = "Set1")) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  ggtitle("Investigating income by education")

Here we find the expected result of an apparent trend in income as educational attainment increases.

# Investigate by region
# Find average family income by educational attainment
avg_income_by_edu_region <- atuscps 
avg_income_by_edu_region <-avg_income_by_edu_region %>%
  separate(famincome, into=c('income_low','income_high'), sep='-',convert=TRUE)%>%
  mutate(income_low = as.integer(income_low)) %>%
  drop_na() %>%
  mutate(fam_income_mid = (income_high+income_low)/2) %>%
  group_by(region,edu) %>%
  summarise(N = n(),
            avg_income = mean(fam_income_mid)) %>%
  arrange(avg_income)
## `summarise()` has grouped output by 'region'. You can override using the `.groups` argument.
avg_income_by_edu_region %>% head(10)
# Compare regions by income
region_df <- avg_income_by_edu_region %>%
  group_by(region) %>%
  summarise(N=n(),
            avg_income = mean(avg_income)) %>%
  arrange(avg_income)
ggplot(region_df, aes(x=region, y=avg_income, label=avg_income)) +
  geom_bar(stat='identity', fill = brewer.pal(n=4,name = "BuPu")) +
  geom_text(aes(label = signif(avg_income, digits = 3)), nudge_y = -2000, color='DarkBlue') + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  ggtitle("Investigating income by region")

Here the results are less clear, although it would appear that the south has relatively low average income.

# Compare income by region and edu
ggplot(avg_income_by_edu_region, aes(x=region, y=avg_income)) +
    geom_col(position='stack') +
    scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    geom_text(aes(label = signif(avg_income, digits = 3)), angle = 90, 
              nudge_y = -10000, color = 'white') +
    facet_grid(~ edu) +
    scale_fill_brewer(palette="Set1") +
    ggtitle("Investigating income by education and region")

Here we see specific education levels may play a mitigating role in the observed disparity by region.

avg_income_by_race <- atuscps 
avg_income_by_race <-avg_income_by_race %>%
  separate(famincome, into=c('income_low','income_high'), sep='-',convert=TRUE)%>%
  mutate(income_low = as.integer(income_low)) %>%
  drop_na() %>%
  mutate(fam_income_mid = (income_high+income_low)/2) %>%
  group_by(race,region) %>%
  summarise(N = n(),
            avg_income = mean(fam_income_mid)) %>%
  arrange(avg_income)
## `summarise()` has grouped output by 'race'. You can override using the `.groups` argument.
avg_income_by_race %>% head(10)
ggplot(avg_income_by_race, aes(x=region, y=avg_income)) +
    geom_bar(position='stack', stat='identity') +
    scale_fill_manual(palette = "Set1") +
    scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    geom_text(aes(label = signif(avg_income, digits = 3)), angle = 90, 
              nudge_y = -5500, color = 'white') +
    facet_grid(~ race) +
    ggtitle("Investigating income by race and region")

The results here show a more nuanced picture. The effects of region on income are mitigated by race differently.

avg_income_by_race_edu <- atuscps 
avg_income_by_race_edu <-avg_income_by_race_edu %>%
  separate(famincome, into=c('income_low','income_high'), sep='-',convert=TRUE)%>%
  mutate(income_low = as.integer(income_low)) %>%
  drop_na() %>%
  mutate(fam_income_mid = (income_high+income_low)/2) %>%
  group_by(race,edu) %>%
  summarise(N = n(),
            avg_income = mean(fam_income_mid)) %>%
  arrange(avg_income)
## `summarise()` has grouped output by 'race'. You can override using the `.groups` argument.
avg_income_by_race_edu
ggplot(avg_income_by_race_edu, aes(x=edu, y=avg_income)) +
    geom_col() +
    scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
    scale_fill_brewer(palette="Set1") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    geom_text(aes(label = signif(avg_income, digits = 3)), angle = 90, 
              nudge_y = -10000, color = 'white') +
    facet_grid(~ race) +
    ggtitle("Investigating income by race and education")

The generally linear appearance of the graph is consistent with the above findings excluding the “Other” categor which has an unusal

Analysis

# Put to the test
atuscps_mid_income <- atuscps 
atuscps_mid_income <- atuscps_mid_income %>%
  separate(famincome, into=c('income_low','income_high'), sep='-',convert=TRUE)%>%
  mutate(income_low = as.integer(income_low)) %>%
  drop_na() %>%
  mutate(fam_income_mid = (income_high+income_low)/2)
test_stats <- lm(fam_income_mid ~ edu + region + race, data=atuscps_mid_income)

summary(test_stats)
## 
## Call:
## lm(formula = fam_income_mid ~ edu + region + race, data = atuscps_mid_income)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -80620 -23129  -6326  16812  99352 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           41965.8      275.7 152.241  < 2e-16 ***
## eduhs diploma          5307.8      255.6  20.770  < 2e-16 ***
## edusome college       13320.0      276.7  48.144  < 2e-16 ***
## eduassociate degree   17348.4      329.5  52.657  < 2e-16 ***
## edubachelor's degree  31020.2      275.1 112.758  < 2e-16 ***
## edumaster's degree    37903.4      353.5 107.211  < 2e-16 ***
## eduprof degree        41615.5      799.1  52.078  < 2e-16 ***
## edudoctoral degree    42897.2      733.4  58.492  < 2e-16 ***
## regionmidwest         -2299.0      254.6  -9.031  < 2e-16 ***
## regionsouth           -3959.9      238.4 -16.610  < 2e-16 ***
## regionwest            -2048.5      262.5  -7.803 6.08e-15 ***
## raceBlack only       -12358.8      240.7 -51.341  < 2e-16 ***
## raceOther             -5780.7      536.0 -10.784  < 2e-16 ***
## raceAsian only         4506.8      476.8   9.451  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31370 on 152028 degrees of freedom
## Multiple R-squared:  0.1716, Adjusted R-squared:  0.1715 
## F-statistic:  2422 on 13 and 152028 DF,  p-value: < 2.2e-16

Here we can see the initial hypothesis was confirmed ane race, education, and region are all significant predictors of family income.

Exploring the relationship bewteen Race and Income

##Background Visualizations

Racial Composition of U.S.

The following bar graph represents the numbers of people of each race in the U.S. population. This provides a frame of reference for the U.S. population composition when viewing later data visualizations.

National Visualization

race_plot <- ggplot(data = atuscps, aes(x = race, fill = race)) +
  geom_bar() +
  scale_fill_brewer(palette = "Set1") +
  labs(title = "Racial Composition of the U.S.", x = NULL, y = NULL, fill = "Race") +
  labs(title = "Regional Racial Compositions in the U.S.", x = "Race", y = "Count") +
  theme(
    plot.title = element_text(hjust = 0.5)
    ) 
race_plot

Regional Visualization The regional racial distributions seem to be similar to the national one. From this visualization, it can be concluded that the most populated region of the U.S. is the South and the least is the Northeast.

race_plot2 <- ggplot(data = atuscps, aes(x = race, fill = race)) +
  geom_bar() +
  scale_fill_brewer(palette = "Set1") +
  labs(title = "Racial Composition of the U.S.", x = NULL, y = NULL, fill = "Race") +
  labs(title = "Regional Racial Compositions in the U.S.", x = "Race", y = "Count") +
  theme(
    plot.title = element_text(hjust = 0.5)
    ) +
  facet_wrap(~ region)
race_plot2

Subcategories of Race

Other demographic variables, such as citizenship status and country of origin, are relevant to race and may act as subcategories of race.

Citizenship Status Across Race

National Visualization Surprisingly, there seems to be a greater percentage of black people who hold citizenship than that of white people. This may be due to greater immigration from European countries. It is clear that Asians are most likely to not hold citizenship status among the racial categories studided.

citizenship_plot <- ggplot(data = atuscps, mapping = aes(x = race, fill = citizen)) +
  geom_bar(position = "fill") +
  scale_fill_brewer(palette = "Set1") +
  labs(title = "U.S. Citizenship Status Across Races", x = "Race", y = "Count", fill = "U.S. Citizen") +
  theme(
    plot.title = element_text(hjust = 0.5)
  )
citizenship_plot

Regional Visualization Aside from Asians, every other racial group consistently has at least 87.5% citizenship in each region. In the west, a greater proportion of Asians are citizens compared to those of the three other regions.

citizenship_plot2 <- ggplot(data = atuscps, mapping = aes(x = race, fill = citizen)) +
  geom_bar(position = "fill") +
  scale_fill_brewer(palette = "Set1") +
  labs(title = "U.S. Citizenship Status Across Races by Region", x = "Race", y = "Count", fill = "U.S. Citizen") +
  theme(
    plot.title = element_text(hjust = 0.5)
  ) +
  facet_wrap(~ region)
citizenship_plot2

Country of Origin

In the atuscps dataset, the variable listing each respondent’s country of origin is binary; repsondents were asked whether or not their country of origin is the U.S. Both the national and regional visualizations for this variable seem to correspond strongly with the U.S. citizenship status visualizations in the previous section.

National Visualization

country_plot <- ggplot(data = atuscps, mapping = aes(x = race, fill = country_born)) +
  geom_bar(position = "fill") +
  scale_fill_brewer(palette = "Set1") +
  labs(title = "Countries of Origin Across Races", x = "Race", y = "Count", fill = "Country of Origin") +
  theme(
    plot.title = element_text(hjust = 0.5)
  )
country_plot

Regional Visualization

country_plot2 <- ggplot(data = atuscps, mapping = aes(x = race, fill = country_born)) +
  geom_bar(position = "fill") +
  scale_fill_brewer(palette = "Set1") +
  labs(title = "Countries of Origin Across Races by Region", x = "Race", y = "Count", fill = "Country of Origin") +
  theme(
    plot.title = element_text(hjust = 0.5)
  ) +
  facet_wrap(~ region)
country_plot2

Race vs. Income

To visualize the possible relationship between race and income, three representations of income were used : family income brackets, average family income, and the ratio of the top 1% to the bottom 99%.

Observations with missing values have been removed from this dataset to better visualize the income graphs

income <- atuscps %>% 
  na.omit()
income

Family Income Brackets

Race As expected, whites and Asians are increasingly larger proportions of income brackets as they ascend. Conversely, blacks and other racial groups are decreasingly smaller proportions of income brackets as they ascend. This income disparity could be attributed to economic and social inequality.

race_income <- ggplot(data = income, mapping = aes(x = famincome, fill = race)) +
  geom_bar(position = "fill") +
  scale_fill_brewer(palette = "Set1") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Racial Breakdown of Each Income Bracket in the U.S.", x = "Family Income Bracket", y = "Count", fill = "Race") +
  theme(
    plot.title = element_text(hjust = 0.5)
  )
race_income

Regional Differences in Income To determine whether the regional income distributions differed from those of the national visualization, the facet_wrap() function was used. Again, the regional trends are similar to those of the national one.

race_income2 <- ggplot(data = income, mapping = aes(x = famincome, fill = race)) +
  geom_bar(position = "fill") +
  scale_fill_brewer(palette = "Set1") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Racial Breakdown of Each Income Bracket in the U.S. by Region", x = "Family Income Bracket", y = "Count", fill = "Race") +
  theme(
    plot.title = element_text(hjust = 0.5)
  ) +
  facet_wrap(~ region)
race_income2

U.S. Citizenship While the proportion of non-citizen blacks in each income bracket remains fairly consistent, the proportion of blacks with U.S. citizenship decreases as the income brackets increase. For whites, the proportion of people decreases for non-citizens and increases for citizens as the income brackets increase. The other racial groups seem to maintain the same distributions regardless of citizenship status. Although this requires further statistical analysis, these results may indicate that citizenship is a significant variable in predicting income.

race_income3 <- ggplot(data = income, mapping = aes(x = famincome, fill = race)) +
  geom_bar(position = "fill") +
  scale_fill_brewer(palette = "Set1") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Racial Breakdown of Each Income Bracket in the U.S. by Citizenship", x = "Family Income Bracket", y = "Count", fill = "Race") +
  theme(
    plot.title = element_text(hjust = 0.5)
  ) +
  facet_wrap(~ citizen)
race_income3

Country of Origin The proportion of whites born outside of the U.S. decreases as the income bracket increases, which corresponds to the trend seen for non-citizen whites in the citizenship visualization. The proportion of blacks born outside of the U.S. remains fairly consistent across each income bracket, which again reflects the citizenship distribution. Distributions for the two other racial groups are similar to those of the national ones.

race_income4 <- ggplot(data = income, mapping = aes(x = famincome, fill = race)) +
  geom_bar(position = "fill") +
  scale_fill_brewer(palette = "Set1") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Racial Breakdown of Each Income Bracket in the U.S. by Country of Origin", x = "Family Income Bracket", y = "Count", fill = "Race") +
  theme(
    plot.title = element_text(hjust = 0.5)
  ) +
  facet_wrap(~ country_born)
race_income4

Average Family Income

The dataset below calculates the average family income for each race within each region.

avg_income_by_race <- atuscps 
avg_income_by_race <-avg_income_by_race %>%
  separate(famincome, into=c('income_low','income_high'), sep='-',convert=TRUE)%>%
  mutate(income_low = as.integer(income_low)) %>%
  drop_na() %>%
  mutate(fam_income_mid = (income_high+income_low)/2) %>%
  group_by(race) %>%
  summarise(N = n(),
            avg_income = mean(fam_income_mid)) %>%
  arrange(avg_income)
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 15474 rows [205,
## 211, 248, 292, 311, 377, 400, 431, 504, 535, 554, 555, 581, 596, 611, 616, 655,
## 662, 710, 766, ...].
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
avg_income_by_race

Race Surprisingly, Asians have the highest average family income, followed by whites, other, then blacks.

avg_income_plot <- ggplot(data = avg_income_by_race, mapping = aes(x = race, y = avg_income)) +
  geom_col() +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
  scale_fill_brewer(palette = "Set1") +
  labs(title = "Average U.S. Family Income by Race", x = "Race", y = "Average Family Income") +
  theme(
    plot.title = element_text(hjust = 0.5)
  ) 
avg_income_plot

Regional Differences in Income The average family income maintains the same distribution for each region as the general racial one in the previous visualization.

avg_income_by_race2 <- atuscps %>%
  separate(famincome, into=c('income_low','income_high'), sep='-',convert=TRUE)%>%
  mutate(income_low = as.integer(income_low)) %>%
  drop_na() %>%
  mutate(fam_income_mid = (income_high+income_low)/2) %>%
  group_by(race, region) %>%
  summarise(N = n(),
            avg_income = mean(fam_income_mid)) %>%
  arrange(avg_income)
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 15474 rows [205,
## 211, 248, 292, 311, 377, 400, 431, 504, 535, 554, 555, 581, 596, 611, 616, 655,
## 662, 710, 766, ...].
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
## `summarise()` has grouped output by 'race'. You can override using the `.groups` argument.
avg_income_by_race2
avg_income_plot2 <- ggplot(data = avg_income_by_race2, mapping = aes(x = race, y = avg_income)) +
  geom_col() +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
  scale_fill_brewer(palette = "Set1") +
  labs(title = "Regional Average U.S. Family Income by Race", x = "Race", y = "Average Family Income") +
  theme(
    plot.title = element_text(hjust = 0.5)
  ) +
  facet_wrap(~ region) +
  geom_text(aes(label = signif(avg_income, digits = 3)), angle = 90, 
              nudge_y = -15000, color = 'white')
avg_income_plot2

U.S. Citizenship For U.S. citizens, the distribution was the same as the ones for the previous two visualizations. Surprisingly, average family income was about $40,000 for all non-citizens except for Asians. Asian non-citizens had a considerably higher average family income than these other racial groups.

avg_income_by_race3 <- atuscps %>%
  separate(famincome, into=c('income_low','income_high'), sep='-',convert=TRUE)%>%
  mutate(income_low = as.integer(income_low)) %>%
  drop_na() %>%
  mutate(fam_income_mid = (income_high+income_low)/2) %>%
  group_by(race, citizen) %>%
  summarise(N = n(),
            avg_income = mean(fam_income_mid)) %>%
  arrange(avg_income)
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 15474 rows [205,
## 211, 248, 292, 311, 377, 400, 431, 504, 535, 554, 555, 581, 596, 611, 616, 655,
## 662, 710, 766, ...].
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
## `summarise()` has grouped output by 'race'. You can override using the `.groups` argument.
avg_income_by_race3
avg_income_plot3 <- ggplot(data = avg_income_by_race3, mapping = aes(x = race, y = avg_income)) +
  geom_col() +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
  scale_fill_brewer(palette = "Set1") +
  labs(title = "Average U.S. Family Income by Citizenship Status Across Race", x = "Race", y = "Average Family Income") +
  theme(
    plot.title = element_text(hjust = 0.5)
  ) +
  facet_wrap(~ citizen) +
  geom_text(aes(label = signif(avg_income, digits = 3)), angle = 90, 
              nudge_y = -15000, color = 'white')
avg_income_plot3

Country of Origin The country of origin trend mirrors that of the one for citizenship. This makes sense, as respondents who reported that they were born outside of the U.S. most likely do not hold U.S. citizenship.

avg_income_by_race4 <- atuscps %>%
  separate(famincome, into=c('income_low','income_high'), sep='-',convert=TRUE)%>%
  mutate(income_low = as.integer(income_low)) %>%
  drop_na() %>%
  mutate(fam_income_mid = (income_high+income_low)/2) %>%
  group_by(race, country_born) %>%
  summarise(N = n(),
            avg_income = mean(fam_income_mid)) %>%
  arrange(avg_income)
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 15474 rows [205,
## 211, 248, 292, 311, 377, 400, 431, 504, 535, 554, 555, 581, 596, 611, 616, 655,
## 662, 710, 766, ...].
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
## `summarise()` has grouped output by 'race'. You can override using the `.groups` argument.
avg_income_by_race4
avg_income_plot4 <- ggplot(data = avg_income_by_race4, mapping = aes(x = race, y = avg_income)) +
  geom_col() +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
  scale_fill_brewer(palette = "Set1") +
  labs(title = "Average U.S. Family Income by Citizenship Status Across Race", x = "Race", y = "Average Family Income") +
  theme(
    plot.title = element_text(hjust = 0.5)
  ) +
  facet_wrap(~ country_born) +
  geom_text(aes(label = signif(avg_income, digits = 3)), angle = 90, 
              nudge_y = -15000, color = 'white')
avg_income_plot4

Statistical Analysis

First, simple linear regresions were conducted to analyze the individual demographic varaibles from the study. Then, they were combined to form a multiple regression to create a more complex and realistic relationship to income.

income2 <- atuscps %>%
  separate(famincome, into=c('income_low','income_high'), sep='-',convert=TRUE)%>%
  mutate(income_low = as.integer(income_low)) %>%
  drop_na() %>%
  mutate(fam_income_mid = (income_high+income_low)/2) 
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 15474 rows [205,
## 211, 248, 292, 311, 377, 400, 431, 504, 535, 554, 555, 581, 596, 611, 616, 655,
## 662, 710, 766, ...].
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
income2
mod1 <- lm(fam_income_mid ~ race, data = income2)
summary(mod1)
## 
## Call:
## lm(formula = fam_income_mid ~ race, data = income2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -61183 -27646 -10146  27564  85064 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     55145.11      96.74  570.06   <2e-16 ***
## raceBlack only -15210.02     255.65  -59.50   <2e-16 ***
## raceOther       -8671.35     578.68  -14.98   <2e-16 ***
## raceAsian only  12286.94     511.35   24.03   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33970 on 152038 degrees of freedom
## Multiple R-squared:  0.02868,    Adjusted R-squared:  0.02866 
## F-statistic:  1496 on 3 and 152038 DF,  p-value: < 2.2e-16
mod2 <- lm(fam_income_mid ~ citizen, data = income2)
summary(mod2)
## 
## Call:
## lm(formula = fam_income_mid ~ citizen, data = income2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -47881 -26631  -9131  33369  82514 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  42485.2      320.5  132.56   <2e-16 ***
## citizenyes   11644.9      333.3   34.94   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34330 on 152040 degrees of freedom
## Multiple R-squared:  0.007964,   Adjusted R-squared:  0.007957 
## F-statistic:  1221 on 1 and 152040 DF,  p-value: < 2.2e-16
mod3 <- lm(fam_income_mid ~ country_born, data = income2)
summary(mod3)
## 
## Call:
## lm(formula = fam_income_mid ~ country_born, data = income2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -47988 -26738  -9238  33262  77475 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        54237.79      95.48  568.06   <2e-16 ***
## country_bornnon-US -6712.89     249.07  -26.95   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34390 on 152040 degrees of freedom
## Multiple R-squared:  0.004755,   Adjusted R-squared:  0.004749 
## F-statistic: 726.4 on 1 and 152040 DF,  p-value: < 2.2e-16

To represent the more complex relationships between all of these demographic variables and income, I used multiple regression

mod4 <- lm(fam_income_mid ~ race + citizen + country_born,  data = income2)
summary(mod4)
## 
## Call:
## lm(formula = fam_income_mid ~ race + citizen + country_born, 
##     data = income2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -69055 -27193  -8443  26557  99441 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         46051.0      462.2   99.63   <2e-16 ***
## raceBlack only     -15564.4      254.0  -61.28   <2e-16 ***
## raceOther           -8503.4      574.7  -14.80   <2e-16 ***
## raceAsian only      18797.7      535.9   35.08   <2e-16 ***
## citizenyes          10455.4      451.8   23.14   <2e-16 ***
## country_bornnon-US  -4927.7      348.4  -14.15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33730 on 152036 degrees of freedom
## Multiple R-squared:  0.04217,    Adjusted R-squared:  0.04214 
## F-statistic:  1339 on 5 and 152036 DF,  p-value: < 2.2e-16

Conclusion

Based on the results of the linear regressions conducted, all of the demographic variables studied are significant due to their small p-values (<2e-16). Unfortunately, this means that the impacts and importance of each variable cannot be distinguished since they returned the same p-values and are all concluded to be significant. In conclusion, race, citizenship, and country of origin are all significant predictors of family income.

Appendix

Contributions

Mahima

  • Exploring the relationship between household size and Income
  • Exploring the relationship between home type and family income

Zach

  • Impact of demographic variables on time usage

James

  • Investigating factors which influence family income

Lihn

  • Exploring the relationship between race and income

Part 1 Data sets

  • Mahima - drug_use
  • Zach - bad_drivers
  • James - endorsements_2020
  • Lihn - hiphop_cand_lyrics